System and metehod for highly scalable real-time and time-based data delivery using server clusters

ABSTRACT

The present invention provides loosely-coupled cluster systems comprising a plurality of servers based on storage attached to the plurality of servers. Videos, or other assets, are automatically replicated within the server system to increase the number of concurrent play requests serviceable. The server systems can detect spikes in demand that may exceed the guaranteed number of concurrent play requests serviceable and dynamically transfer the high-in-demand or ‘hot’ asset to servers in the cluster that do not have the video. Alternatively, instead of transferring the entire asset, varying length prefixes of the asset may be transferred depending on the availability of resources. The remainder of the asset is transferred in some embodiments on demand with sufficient buffering or other storage to guarantee playback to the user or subscriber according to the required quality of service (QOS).

RELATED APPLICATIONS

This is a continuation of and claims priority to U.S. application Ser.No. 10/205,476 filed Jul. 24, 2002 entitled “System And Method ForHighly Scalable Real-Time And Time-Based Data Delivery Using ServerClusters,” and is related to U.S. application Ser. No. 12/038,798 filedFeb. 27, 2008, which is itself a divisional of U.S. application Ser. No.10/205,476 filed Jul. 24, 2002, both of which are incorporated byreference herein.

FIELD OF THE INVENTION

The invention relates generally to server systems and methods forserving content and more particularly to server systems and methods thatfacilitate real-time and time-based media streaming and hot-spot orhigh-demand asset management particularly for streaming DVD qualityvideo content assets.

BACKGROUND

In order to deliver (or stream) real-time or time-based data from aserver system to an end-user system, a number of system resources mustbe tightly managed. Typically, a video server system comprises videoserver hardware and software while an end-user system refers to aset-top box and TV, Personal Computer (PC), or other user device.Resources that must be tightly managed include InpuffOutput (110)resources such as disk drive (or other storage media) space and diskdrive (or other storage media) bandwidth, CPU resources, memory, andnetwork bandwidth.

Real-time and/or time-based media streaming, such as video streaming orvideo-on-demand (for example, movie, music, or other multi-mediaon-demand on a seftop-box or other device connected to a television setor other receiver) is an extremely cost-sensitive business.

Because of the bandwidth required to deliver a high quality video stream(typically 3 to 8 Megabits/second/user), these applications placetremendous load on the video server's memory, disk (or other storagemedia) and network subsystems. When such an application scales fromserving a few users (for example, tens to hundreds) to very largenumbers of users (for example hundreds of thousands or millions), thetotal solution cost, using today's 30 technologies becomecost-prohibitive. Business economics for example may initially benefitfrom a small low cost system that can service a limited number of usersor subscribers. As the number of users or subscribers grows the initialsystem is augmented to add additional capacity. Desirably the initialsystem is retained and the initial system architecture is retained andscaled to serve the larger set of users.

Typical video-on-demand deployments start small and grow. A small serversystem capable of serving a few hundred users eventually must becomepart of a larger system that serves hundreds of thousands. Heretofore,there have generally been two approaches that have been taken to addressthis system size or system capacity scaling problem: (1) Deployment anduse of tightly-coupled multiprocessor systems delivering a large numberof streams, and (2) Loosely coupled clusters that are composed of small,off-the-shelf computers, but connected using standard computer networks.

Examples of these types of configuration are described relative to FIG.1 and FIG. 2. With reference to FIG. 1, there is illustrated a portionof one embodiment of a tightly-coupled multiprocessor system, server 50,delivering a large number of streams. Server 50 has the capacity for alarge number of processors, usually embodied as processor boards.Accordingly, server 50 comprises a plurality of slots, such as slots 60,62, 64, 66, and 68. In one embodiment, server 50 has 256 slots, and istherefore capable of comprising 256 processor boards. Typically, server50 begins service with a few processor boards, such as boards 70, 72,and 74, and boards are added as the system grows. Such a system tends tobe very costly and does not usually meet the strict cost constraintsplaced by business. There is also the potential for failure of oneboard, such as processor board 72, to cause total failure of server 50.Further, as the system grows, the cost of computational power decreases,and the processor boards required to update the system may be outdatedby the time a system administrator is prepared to grow the serversystem.

Examples of the loosely coupled clusters that are composed of small,off-the-shelf computers, but connected using standard network may forexample use Gigabit Ethernet or Fiberchannel networking and use softwareto manage the collection of systems as a single entity capable ofmeeting some scalability and quality of service requirements. Anexemplary system according to this loosely coupled cluster concept isillustrated in FIG. 2. FIG. 2 depicts servers 80 and 82 operatingtogether as a cluster, receiving requests from load balancer 79 (a Layer4 switch). Servers 80 and 82 each have access to all assets—includingasset 86, asset 88, and asset 90 through fiber-channel switch 84. Theshared storage includes additional components—fiber-channel switches,switch adapters, disks that are fiber-channel capable, etc. All areadditional cost components and add complexity to the scalability of thenetwork.

In addition, the shared storage cluster shown in FIG. 2 does not solvethe resource management problem. For example, a video stored on a diskattached to a shared fiber channel switch still has its limitations onthe amount of bandwidth available from the disk or through a fiberchannel link. Thus, if a particular asset, or video, becomes inhigh-demand or is “hot” (where a lot of subscribers are requesting thevideo simultaneously and exceeding any disk's capacity to serve it orany one server's capacity in terms of disk or network bandwidth, toserve it), additional mechanisms are required to handle it. Manyconventional systems attempt to copy high-demand or ‘hot’ assets ontoswitch memory or server physical memory 84 for faster access. However,these schemes fail beyond a certain size file or asset, as the systemresource requirements become prohibitive for large video files.

Further, conventional load balancing handles requests from clientdevices and spreads them across to various servers to effectivelybalance network bandwidth as well as connection overheads (usually insoftware). However, the present solutions fail to take into account theI/O problem—the problem that happens at the I/O subsystem wherecontention for a video file or for storage system video file retrievalbandwidth causes the disk subsystem to run out of resources.

This input/output problem is endemic to any time-based media (such asaudio and video) and real-time content delivery, and is especially truefor “high-quality” or “high-value” video content. For example, a typicalmovie for a movie-on-demand application generally needs to be deliveredat 4 Mbps to 8 Mbps today and up to 20 Mbps for a high-definition (HD)system and over a period of 90 to 120 minutes. For such an application,continued availability of resources—such as disk or other storagesubsystem bandwidth, memory, network bandwidth, and CPU resources—over along period of time is required to deliver a video service. Customerssimply will not subscribe to a paid service to see a full length movieat lower than broadcast quality and may not even be inclined tosubscribe unless the movie is the quality of a DVD or equivalent movie.

This is in contrast to existing load balancing/cluster systems forsolving computational problems or data delivery problems (such asserving web pages from a server cluster at an aggregation site).Computational clusters usually tax the disk subsystems very littlewhereas data clusters for non-time-based data (such as graphics imagesor web pages) tax the disk subsystem, but they do not have real-timedelivery semantics associated with them. For example, users willgenerally tolerate parts of a web-page loading slowly whereas breakupsin audio and video are considered less tolerable or intolerable.Subscribers simply will not subscribe to a video (movie) deliveryservice where the play is broken or erratic in time, or the requiredframe-rates (typically 24 or 30 frames/second) cannot be maintained.

A single copy of a video on a server's disk subsystem can only service acertain number of concurrent play requests. This number is typicallylimited to by the hard disk's bandwidth. For example, if a disk provides30 Megabytes of bandwidth for read/write access, it implies that it cansupport delivery of videos encoded at 5 Megabits/second to 48 usersconcurrently ((30 Megabytes×8 bits/byte)/5 Megabits/second=48 persecond). Striping techniques, where a file system is built on top of anumber of such disks, increase the number of concurrent users. However,there is an upper limit to the number of concurrent users the subsystemcan server. When a video (or other content) becomes “popular”, morecopies of that video need to be provided to increase the concurrentnumber of plays available given the disk drive bandwidth. (Note thatthis disk drive bandwidth requirement is entirely different from diskdrive storage capacity.) If the relative popularity of the video isknown, a predetermined number of copies can be provided. However,dynamic spikes in interest or demand for a particular video movie orother real-time deliverable video content item may occur in a real-timestreaming system.

Accordingly, there is a need in this art for a scalable server system,method, architecture, and topology that is able to cost-effectively,timely, and easily increase the number of users serviceable. Such asystem should be viable for time-based media delivery, includingstreaming of broadcast, DVD, and HD movie quality video.

There is a further need in this art for a server system, method,architecture, and topology capable of managing system resources and loadbalancing to effectively provide real-time asset streaming, includingstreaming of broadcast and DVD movie quality video assets. Management ofresources would extend to disk management, CPU management, memorymanagement, and network bandwidth management.

There is still a further need in this art for a server system, method,architecture, and topology capable of dynamically adjusting to contentdelivery service demand in a real-time system. That is, a server systemcapable of automatically and dynamically increasing its capacity forplaying out a specific asset, such as a specific video movie, whendemand for that asset increases.

SUMMARY

The invention provides system, apparatus, method, computer program andcomputer program product, and business method and model for distributionof media assets to users or subscribers. The inventive system and methodare highly scalable architecturally and on a dynamic demand basis.

In one aspect the present invention provides loosely-coupled clustersystems comprising one or a plurality of servers based on storageattached to the server(s). In another aspect, videos, or other assets,are automatically replicated within the server system to increase thenumber of concurrent play requests serviceable. In another aspect, theserver systems can detect spikes in demand that may exceed theguaranteed number of concurrent play requests serviceable anddynamically transfer the high-in-demand or ‘hot’ asset to servers in thecluster that do not have the asset. Alternatively, instead oftransferring the entire asset, varying length prefixes of the asset maybe transferred depending on the availability of resources. The remainderof the asset is transferred in some embodiments on demand withsufficient buffering or other storage to guarantee playback to the useror subscriber according to the required quality of service (QOS).

In one embodiment, the invention provides a server system for time-basedmedia streaming comprising: a plurality of servers coupled forcommunication with each other, including a first server and secondserver, the first server comprising: a first computer-readable storagemedium encoded with stored server information comprising assetinformation associated with the second server; a first computer-readablestorage device associated with the first server encoded with first assetinformation; and a second computer-readable storage device associatedwith the second server encoded with second asset information.

In another embodiment, the invention provides a method for time-basedstreaming of assets, the method including: receiving a request for anasset at a first server; determining if the first server has the asset;determining if the first server has sufficient resources to stream theasset; streaming the asset while maintaining a time-base for thestreamed asset if the first server has the asset and the first serverhas sufficient resources to stream the asset; and

if the first server does not have the asset, or the first server doesnot have sufficient resources to stream the asset, attempting toidentify a second server having the asset and sufficient resources tostream the asset; and forwarding the request to the identified secondserver.

In another embodiment, the invention provides a method for time-basedstreaming of assets and load-balancing, the method including: receivinga request for an asset at a first server having the asset and sufficientresources to stream the asset; streaming the asset while maintaining atime-base for the streamed asset if the first server has a first serverload level less than a load threshold value; and if the first server hasa load level greater than a load threshold level, the method furtherincluding: attempting to find a second server having the asset,sufficient resources to stream the asset, and a second server load levelless than the first server load level; forwarding the request if thesecond server is located; and streaming the asset while maintaining atime-base for the streamed asset if the second server is not located.

In another embodiment, the invention provides a method for time-basedstreaming of assets, the method including: receiving a request for anasset at a first server; determining if the first server has the asset;determining if the first server has sufficient resources to stream theasset; and if the first server does not have the asset or the firstserver does not have sufficient resources to stream the asset,forwarding the request to a second server having the asset andsufficient resources to stream the asset; and if the first server hasthe asset and sufficient resources to stream the asset, determining ifthe first server has a load level less than a load threshold value; andif the first server has a first server load level less than a loadthreshold value, streaming the asset and maintaining a time-base for thestreamed asset; and if the first server has a load level greater than aload threshold level, attempting to find a second server having theasset, sufficient resources to stream the asset, and a second serverload level less than the first server load level; forwarding the requestif the second server is located; and streaming the asset and maintaininga time-base for the streamed asset if the second server is not located.

The invention further provides various computer programs and computerprogram products adapted for execution on general purpose computers,servers, and information systems.

The invention also provides a business model and method for distributionof content and assets (such as video movies) as well as a business modeland method for operating and growing a scalable content and assetdistribution system.

In another embodiment, the invention provides a business model foroperating a time-base accurate asset streaming business, the businessmodel comprising: operating a first server to receive and servicerequests for an asset, the first server (i) receiving a request for anasset, (ii) determining if the first server has the asset available fortime-base accurately streaming and has sufficient resources to time-baseaccurately stream the asset, and (iii) time-base accurately streamingthe asset if it is determined that the first server has the assetavailable for time-base accurately streaming and has sufficientresources to time-base accurately stream the asset; and if thedetermining indicates that the first server does not have the assetavailable for time-base accurately streaming or does not have sufficientresources to time-base accurately stream the asset, then: (i)identifying a second server having the asset available for time-baseaccurately streaming and sufficient resources to time-base accuratelystream the asset, and (ii) forwarding the request to the identifiedsecond server for servicing by the second server.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention may be better understood, and its features andadvantages made apparent to those skilled in the art by referencing theaccompanying drawings.

FIG. 1 is a diagrammatic illustration showing an embodiment of atightly-coupled multiprocessor system, as known in the art.

FIG. 2 is a diagrammatic illustration showing an embodiment of a generalarchitecture of a loosely-coupled server system with a fiber switch, asknown in the art.

FIG. 3 is a diagrammatic illustration showing an embodiment of a clustersystem with direct attached storage, according to an embodiment of thepresent invention.

FIG. 4 is a diagrammatic illustration showing an embodiment of a clustersystem with shared storage according to an embodiment of the presentinvention.

FIG. 5 is a diagrammatic illustration showing an embodiment of a clustersystem with hierarchical storage according to an embodiment of thepresent invention.

FIG. 6 is a diagrammatic illustration showing an embodiment of an IntraCluster Protocol message format, according to an embodiment of thepresent invention.

FIG. 7 is a diagrammatic illustration showing an embodiment of anactivation process according to an embodiment of the present invention.

FIG. 8 is a diagrammatic illustration showing an embodiment of a methodfor calculating indices in a Summary Cache, according to an embodimentof the present invention.

FIG. 9 is a schematic overview of one embodiment of a request forwardingprocedure for a server in a cluster, according to one embodiment of thepresent invention.

FIG. 10 is a diagrammatic illustration showing a control flow when aplay request is forwarded using RTSP, according to an embodiment of thepresent invention.

FIG. 11 is an illustration of a graphical appearance of one aspect ofthe cluster management console, according to an embodiment of thepresent invention.

FIG. 12 schematically depicts a process through which events (traps) arepropagated to the Cluster Console, according to an embodiment of thepresent invention.

FIG. 13 depicts a Load Monitor displayable by the Console, according anembodiment of the present invention.

FIG. 14 depicts a stream monitor displayable on the console, accordingto an embodiment of the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Generally, the present invention provides loosely-coupled clustersystems comprising a plurality of servers based on storage directlyattached to the plurality of servers. Videos, music, multi-mediacontent, or other assets, are replicated within the server system toincrease the number of concurrent play requests for the videos, music,multi-media content, or other assets serviceable. For convenience thesevarious videos, movies, music, multi-media content or other assets arereferred to as video or movies as these are the most prevalent types ofassets; however, it should be clear that references to any one of theseasset or content types, such as to video or movies, refers to each ofthese other types of content or asset as well.

In some embodiments, the server systems detect spikes in demand that mayexceed the guaranteed number of concurrent play requests serviceable. Insome embodiments, the server systems dynamically replicate the ‘hot’,high-demand or frequently requested asset to servers in the cluster thatdo not have the video. (High-demand or frequently requested assets areconveniently referred to as “hot” or as “hot-assets” in thisdescription.) Alternatively, instead of replicating the entire asset,varying length “prefixes”, or initial portions, of the asset may bereplicated depending on the availability of resources. The remainder ofthe asset is transferred in some embodiments on demand with sufficientbuffering or other storage to guarantee playback to the user orsubscriber according to the required quality of service (QOS). Assets asused herein generally refers to data files. Assets stored on, andstreamed by, server systems discussed herein preferably comprisereal-time or time-based assets, and more preferably comprise videomovies or other broadcast, DVD, or HD movie quality content, ormulti-media having analogous video movie component. It will also beappreciated that as new and different high-bandwidth content assets aredeveloped such high-bandwidth content assets benefiting from real-timeor substantially real-time play may also be accommodated by theinventive system and method.

Accordingly, the present invention provides a server system, method,architecture, and topology for real-time and time-base accurate mediastreaming. The terms real-time and time-base or time-base accurate aregenerally used interchangeably in this description as a real-time playgenerally meaning that streaming or delivery is time-base accurate (itplays at the designated play rate) and is delivered according to someabsolute time reference (that is there is not too much delay between theintended play time and the actual play time). In general, real-time playis not required relative to a video movie but real-time play orsubstantially real-time play may be required or desired for a livesporting event, awards ceremony, or other event where it would not beadvantageous for some recipients to receive the asset with a significantdelay relative to other recipients. For example, it is desirable thatall requesting recipients of a football game would receive both atime-base accurate rendering or play out and that the delay experiencedby any recipient be not more than some predetermined number of seconds(or minutes) relative to another requesting recipient. The actualtime-delay for play out relative to the live event may be any period oftime where the live event was recorded for such later play. In oneembodiment, a requestor selecting such event asset play during delayedlive play out may choose between beginning play at the start of theasset or joining the asset play synchronized with the pay to otherrequesting recipients.

Streaming, as used herein, generally refers to distribution of data.Aspects of the invention further provide computer programsoftware/firmware and computer program product storing the computerprogram in tangible storage media. By real-time (or time-based)streaming, herein is meant that assets stored by or accessibly by theserver system are generally transmitted from the server system at areal-time or time-base accurate rate. In other words the intended playor play out rate for an asset is maintained precisely or within apredetermined tolerance. Generally, for movie video streaming usingcompression technology available today from the Motion Pictures ExpertGroup, (MPEG), a suitable real-time or time-base rate is 4 to 8Megabits/second, transmitted at 24 or 30 frames/second. Real-time ortime-base asset serving maintains the intended playback quality of theasset. It will be appreciated that in general, service or play of anordinary Internet web page or video content item will not be real-timeor time-base accurate and such play may appear jerky with a variableplayback rate. Even where Internet playback for short video clips of afew to several seconds duration may be maintained, such real-time ortime-base accurate playback cannot be maintained over durations ofseveral minutes to several hours.

Server systems according to the present invention may be described as orreferred to as cluster systems, architectures, or topologies. That is,the server systems comprise a plurality of servers in communication(electrical, optical, or otherwise) with each other. A variety ofservers for use with the present invention are known in the art and maybe used, with MediaBase servers made by Kasenna, Inc. of Mountain View,Calif. being particularly preferred. Aspects of server systems andmethods for serving media assets are described in co-pending U.S. patentapplication Ser. No. 09/916,655 filed 27 Jul. 2001 entitled ImprovedUtilization of Bandwidth in a Computer System Serving Multiple Users;U.S. patent application Ser. No. 08/948,668 filed 14 Oct. 1997 entitledSystem For Capability Based Multimedia Streaming over A Network; andU.S. patent application Ser. No. 10/090,697 filed 4 Mar. 2002 entitledTransfer File Format And System And Method For Distributing MediaContent; each of which applications are hereby incorporated byreference.

Each server within the server system generally comprises at least oneprocessor and is associated with a computer-readable storage device,such as a disk or an integrated memory or other computer-readablestorage media, which stores asset information. Asset informationgenerally comprises all or part of the asset, or metadata associatedwith the asset, as described more fully below. A plurality ofprocessors, such as two, three, four, five, six, seven, eight, or moreprocessors or microprocessors may be utilized in any given server. Eachserver within the system further has access to “load” information aboutother servers within the system, or cluster. Load information isdiscussed further below. When receiving a request, then, each server candecide whether to serve or play the requested asset itself, or totransfer the request to another server that has the asset. When choosingwhere to route the request, if the server is going to transfer therequest, the server may take into account load information about theother servers, as well as what type of asset information the otherservers have (the entire asset, a prefix of the asset, or metadata, andthe like). If the server receiving the request does not have therequested asset, it can transfer the request to another server that doeshave the asset, or request the asset from a shared (or otherwiseaccessible) storage device. In some embodiments, a system administrator,or other source, may provide a load threshold value, as discussedfurther below. Servers within the cluster have access to the loadthreshold value. When a first server receives a request and has a loadgreater than the load threshold value, it will attempt to locateanother, less loaded, server to service the request even if the firstserver has the asset and is able to service the request.

The present invention further provides methods and systems and computerprogram and computer program product for hot (or high-demand) assetmanagement. That is, a system administrator, or other source, mayprovide a hot (or high-demand) asset count and a hot (or high-demand)asset time period. The server system, or cluster, keeps track of thenumber of requests received for a given asset. If the number of requestsexceeds the hot asset count within the hot (or high-demand) assetperiod, the asset is deemed ‘hot’ or in high-demand, and a server havingaccess to the asset can make a copy onto another server that does nothave access to the asset. By ‘have the asset’ herein is generally meantthat the server has asset information associated with the requestedasset, such as all or a portion of the asset, stored in its directattached or integrated storage device or memory. Alternatively, a firstserver, upon determining that an asset is hot, may copy a variablelength prefix of an asset to a second server that does not have theasset. Upon receiving a request for that asset, the second server canrequest the entire asset from the first server. The idea is that thesystem monitors interest in or demand for the asset, such as a videomovie, and when it appears that the interest or demand is such that thedemand on the server will exceed its storage device service bandwidthcapacity, it creates another service process to provide for the expecteddemand. Systems, methods, and computer programs according to the presentinvention are discussed in further detail below.

A server cluster according to embodiments of the present inventioncomprises a plurality of servers working together to service a request.The plurality of servers may have independent disks, or other computerreadable storage devices, or share disks through a file system over ashared storage system, such as networked attached storage (NAS) or astorage area network (SAN). Operationally, the cluster may be deployedat the origin site, where the original assets reside, or at an edgewhere a server is primarily used as a streaming media cache.

In some embodiments, the front end of the cluster is a load-balancingcomponent that directs user request to one of the servers within thecluster, or system. In preferred embodiments, the load-balancingcomponent comprises a Layer 4 switch. In other embodiments, theload-balancing component comprises a software load balancing proxy orround-robin DNS. These and other load-balancing components are known inthe art. In further preferred embodiments of the present invention, noload-balancing component is necessary, and the load-balancing iseffectively performed by a server receiving user requests, whichforwards or accepts the requests as appropriate, and as describedfurther below. In such embodiment, a Level 2 switch may be provided asan interface to the servers within the cluster. It will be appreciatedthat the cost of a simple Layer 2 switch is a faction of the cost of aLayer 4 load-balancer so that embodiments of the invention provideconsiderable cost savings and economies over those embodiments requiringexternal load-balancers.

In a first preferred embodiment, depicted schematically in FIG. 3,cluster system 100 is provided comprising a plurality of serversincluding server 105, server 110, and server 115. A variety of suitablemedia servers are known in the art, with MediaBase servers (Kasenna,Inc.; Mountain View, Calif.) being particularly preferred. Servers 105,100, and 115 each comprise a computer-readable storage medium encodedwith a computer program module that, when executed by at least oneprocessor, enables the server to broadcast load information, receive andstore load information, and/or provide the load balancing and hot-assetmanagement functionalities described further below. Alternatively, thesefunctionalities may be provided by a plurality of computer programmodules. Each server is associated with its own independentstorage—computer-readable storage device 108, 113, and 118,respectively. Servers 105, 110, and 115 are in communication with oneanother. In system 100, servers 105, 110, and 115 are in communicationvia local area network (LAN) 120. In other embodiments, servers 105,110, and 115 are in communication via a LAN for streaming, and have aseparate connection (for example, a direct or wireless connection) formessaging amongst each other. In other embodiments, servers 105, 110,and 115 are in communication via a wide area network (WAN). Othercommunication means and/or protocols may be utilized as are known in theart for coupling computers, networks, network devices, and informationsystems.

User requests come to cluster 100 as, for example, a hyper-texttransport protocol (HTTP) or real time streaming protocol (RTSP)request, although a variety of other protocols known in the art aresuitable for forming user requests. The requests are directed viaload-balancing component 125, shown as a Layer 4 switch in FIG. 1, toone of the servers in the cluster. In other embodiments, load-balancingcomponent 125 is not present and user requests are received directly byone or a plurality of servers in cluster 100. Media assets reside onlocal disks including disk 108, 113, and 118. Media assets, as discussedabove, are preferably data files requiring real-time delivery, and morepreferably video files. Generally any media format may be supported withMPEG-1, MPEG-2, and MPEG-4 formats being preferred. The clusterreplication policy can range from no replication to partial to fullreplication. Installing an asset into the cluster generally requires anadministrator, or other authorized user, to determine which server orservers should host the asset and install the asset on those servers.Adding additional servers preloaded with asset information can increasethe throughput of cluster 100.

Accordingly, in one embodiment of cluster 100, by way of example, 1000media assets are stored (in fact any number of media assets may bestored). If the assets are high quality MPEG-2 format (encoded at 4Mb/s) movies and if each asset is 2 hrs in length (a typical full lengthfeature movie), approximately 4.5 gigabyte (GB) of storage is requiredper movie. The size and length of assets will vary accordingly to thespecific asset stored, and the above numbers are given by way of exampleonly. Cluster 100 therefore required 4.5 terabyte (TB) (4.5 GB×1000) ofstorage with no replication. Two-way replication would require 9 TB ofstorage. Accordingly, cluster 100 may comprise 12 servers each witharound 800 GB of direct attached storage to support two-way replication.Each server would further be required to play out around 42 streams andthe network required to have an aggregate serving bandwidth of 2 Gb/s (4Mb/s×500) to support 500 users. These metrics and storage requirementswill vary according to the size and length of stored assets, theencoding rate of the assets, the desired degree of replication, and thedesired number of supported users. The above numbers are provided by wayof example and are not intended to limit the invention.

In another embodiment of the present invention, schematically depictedin FIG. 4, cluster 200 comprises shared storage system 210. Sharedstorage system 210 may comprise, for example a network attached storage(NAS) system, or a storage area network (SAN). Shared storage system 210communicates to servers in cluster 200, such as server 215, 220, and 225via network connection 230, such as a SAN or data local area network(LAN). In embodiments comprising a SAN, the SAN comprises its own datanetwork. In some embodiments, the SAN data network comprises fiberswitches and the like. In other embodiments network connection 230comprises other components providing functionality to communicatebetween shared storage system 210 and servers 215, 220 and 225. Asdescribed above, server 215, 220 and 225 are in electronic communicationthrough, for example, LAN 240. In preferred embodiments comprising aNAS, LAN 240 is the same as LAN 230. In other embodiments comprising aNAS, LAN 240 and LAN 230 are separate networks. In other embodiments, asdiscussed above, the servers are in direct communication or have aseparate wireless connection. In still other embodiments, the servershave one communication network or link for asset transfer and streamingand a second communication network or link for messaging andcommunication amongst themselves. Servers 215, 220, and 225 eachcomprise a computer-readable storage medium encoded with a computerprogram module that, when executed by at least one processor, enablesthe server to broadcast load information, receive and store loadinformation, and/or provide the load balancing and hot-asset managementfunctionalities described further below. Alternatively, thesefunctionalities may be provided by a plurality of computer programmodules. Load-balancing component 250 may pass user requests to serverswithin cluster 200, as discussed above with reference to FIG. 1. Inother embodiments, load-balancing component is unnecessary and notpresent.

In cluster 200, assets reside on shared storage system 210. Individualservers, such as server 215, 220, and 225 store asset metadata locallyin direct attached, or integrated, storage. Metadata generally comprisesinformation about an asset, such as a video, including encoding type,bit rate, duration, and/or the like. Installing an asset into cluster200 generally involves installing the asset on the shared storage systemand distributing the metadata associated with the asset to all theservers in the cluster. Generally, any server may be used to install anasset onto the shared storage system and copy the metadata to the restof the servers in cluster 200.

Using the cluster example given above—providing 1000 high-quality MPEG-2titles each lasting 2 hours with two-way replication and supporting 500users—cluster 200 would require 4.5 TB of storage on the shared storagesystem. Using servers capable of playing out 125 streams, cluster 200would require 4 servers. Further, the network between clients andservers required an aggregate bandwidth of (4 Mb/s×500) 2 Gb/s. The datanetwork 230 between servers and storage would require a similarbandwidth. The actual required bandwidth, number of servers, and amountof required storage will vary according to the number, type and lengthof asset stored, number of servers utilized in cluster 200, and thedesired number of supported users. The above numbers are given only byway of example.

In a third embodiment, shown schematically in FIG. 5, cluster 300 isprovided comprising hierarchical storage. In this embodiment, assetsreside at centrally administered server cluster 310 (the head end) andstreaming occurs at the edges. An edge generally refers to a location ina server system that is closer to an end user. An edge server is aserver located at an edge of a network and an edge cluster is a set ofservers located at an edge. Edge streaming clusters, such as cluster 320and 330 are similar to the direct attached storage embodiment, discussedabove with regard to FIG. 1. In operation, if an asset is requested andis not found in an edge cluster, the asset is requested from higherlevels of storage (i.e. from cluster 310). Cluster 310 and edge cluster320 and 330 are in communication via a content distribution network,which may be another LAN. In some embodiments, the content distributionnetwork is a WAN or other network connection, and the appropriateprotocols and messaging systems are used to facilitate communicationbetween inner clusters and edge clusters. In some embodiments, thecontent distribution network shares traffic with a network connectionbetween edge servers, or between edge servers and end users. Serverswithin cluster 320 and 330 each comprise a computer-readable storagemedium encoded with a computer program module that, when executed by atleast one processor, enables the server to broadcast load information,receive and store load information, and/or provide the load balancingand hot-asset management functionalities described further below.Alternatively, these functionalities may be provided by a plurality ofcomputer program modules.

In cluster 300, any server can generally be used to install an asset.Installation generally involves placing the asset in the headend andinstalling a metadata entry and a prefix associated with the asset inall the servers in the edge clusters, such as cluster 320 and 330.

Utilizing the cluster example above—providing 1000 high-quality MPEG-2titles each lasting 2 hours with two-way replication and supporting 500users—cluster 300 required 4.5 TB of storage at the headend. At theedges, assuming that each server caches 100 titles and stores a 5percent prefix of all 1000 titles, each server would require 652.5 GB(100×4.5 GB+900×0.225 GB) of storage for the cache. Assuming that aserver can play out 125 streams, cluster 300 would require 4 servers.The network between subscribers and edge clusters would need to have anaggregate bandwidth of 2 Gb/s (4 Mb/s×500). These metrics and storagerequirements will vary according to the size and length of storedassets, the encoding rate of the assets, the desired degree ofreplication, and the desired number of supported users. The abovenumbers are provided by way of example and are not intended to limit theinvention.

Choice of cluster configuration—direct attached storage as in cluster100, shared storage as in cluster 200, or hierarchical storage as incluster 300—depends on requirements as to cost, number of requiredstreams, and number of supported users. It is anticipated that oneconfiguration, such as cluster 200 may be implemented and laterreconfigured into another configuration, such as cluster 300.

The above description recites various configurations of a clusteraccording to the present invention. Servers within the cluster containat least one processor, and are configured to perform a variety offunctionalities with respect to streaming assets, messaging betweenservers, and routing requests. These functionalities are generallyprovided as a service, herein referred to as a node agent (or“nodeagent”), that is embedded as a computer program module encoded in acomputer-readable storage medium within a server and executed by one ormore processors. The computer program module, or service or node agentas used herein, contains instructions that, when executed, provide theservers with a variety of messaging and/or other performancefunctionalities. These functionalities are discussed further below. Anode agent may be implemented using any of a variety of computer programmodule protocols or languages as known in the art, with implementationas a Common Object Request Broker Architecture (CORBA™) service beingparticularly preferred. It is to be understood that a node agent may beimplemented in any of the above described cluster embodiments, or thelike. Particularly, a node agent may be installed on any, some, or allof servers 105, 110, 115 in FIG. 3, servers 215, 220, and 225 in FIG. 4,and servers within clusters 320 and 330 in FIG. 5.

A node agent generally exports an interface through which otherservices, or computer program modules, on the server or in communicationwith the server interact with the node agent. This interface may be anyof a variety of interfaces as known in the art, for example, an InternetInter-Orb Protocol (IIOP). In some embodiments, a plurality ofinterfaces are exported by the node agent, each interface forcommunication via a different protocol.

In some embodiments, the node agent further supports a message-basedprotocol built over a user datagram protocol (UDP) called the IntraCluster Protocol (ICP), used for exchanging bootstrapping, load, andevent notification messages between nodeagents in a cluster—that is,generally, between servers.

The Intra Cluster Protocol (ICP) is an extension of the Internet CacheProtocol, as known in the art and described further in, for example“Internet Cache Protocol”, version 2, Wessis, D. and Claffy, K., RFC2186, September, 1997, hereby incorporated by reference herein. TheIntra Cluster Protocol is used by the node agent for bootstrapping, loadinformation exchange, asset inserts and delete notifications and failuredetection. An embodiment of the Intra Cluster Protocol message format isshown schematically in FIG. 6. Briefly, message 400 comprises header 410comprising operation code (opcode) field 420, version field 430 and datalength field 440. Header 410 is preferably 4 bytes in length, althoughsubstantially any length may be chosen and implemented accordingly.Message 400 further comprises data field 450. Some opcodes used inpreferred embodiments of message 400 are shown in Table 1. Otherstandard opcodes are supported in some embodiments, includingICP_INVALID, ICP_QUERY, ICP_HIT, ICP_MISS, and ICP_MISS_NOFETCH.

An I am alive opcode (I_AM_ALIVE) 525 is used to indicate a bootstrapmessage that is sent to inform servers that a first server is up andrunning. The message size is preferably 8 bytes, but may vary accordingto the specific protocol implemented. A peer opcode (PEER) 530 is sentas response to a message comprising the ‘I am alive’ opcode (I_AM_ALIVE)525. As before, the message size is preferably 8 bytes, but may vary. Adigest opcode (DIGEST) 535 is used to indicate a message used forexchanging summary caches, described further below. In embodiments whereICP messaging is used for server discovery (sending I_AM_ALIVE, PEER,and/or DIGEST messages), servers within the cluster should be on a samenetwork subnet. This requirement is removed when another messagingprotocol is chosen, as is known in the art. A load opcode (LOAD) 540 isused to indicate a message sent periodically to inform other serversabout the load on a first server, as discussed further below.Preferably, the maximum message size is 8 bytes. An asset insert opcode(ASSET_INSERT) 545 indicates a notification message sent to inform otherservers that an asset has been installed on a first server. Preferably,the maximum message size is 20 bytes plus the length of the asset nameplus the length of the server name that has had the asset installed. Anasset delete opcode (ASSET_DELETE) 550 indicates a message sent out toinform other servers that an asset has been deleted on a first server.Preferably, the maximum message size is 20 bytes plus the length of theasset name plus the length of the server name from which the asset hasbeen deleted. A node shutdown opcode (NODE_SHUTDOWN) 555 indicates amessage sent to inform other servers if a node has been shut down—by anadministrator or otherwise. Preferably, the message size is 4 bytes. Acluster shutdown opcode (CLUSTER_SHUTDOWN) 560 indicates a message sentif an entire cluster is shut down—by an administrator or otherwise.Preferably, the message size is 4 bytes. A load frequency change opcode(LOAD_FREQ_CHANGE) 565 indicates a message informing other servers thatthe load frequency has been altered. Load frequency is discussed furtherbelow. Some servers use this type of message to reset their failuredetection alarms in addition to or instead of alerting themselves thatthe load frequency is altered. Preferably, the message size is 8 bytes.An ICP interface change opcode (ICP_IF_CHANGE) 570 indicates a messageto a server that the bootstrap interface has been changed, and it needsto listen and send on the new interface. Preferably, the message size is4 bytes. The opcodes above, include preferred uses for the opcodes andpreferred sizes of the associated messages are presented by way ofexample. However, it will be readily appreciated by those skilled in theart that any of a variety of opcodes may be designated for a particularmessage. Further, the above specific interfaces are presented by way ofexample and it will be readily appreciated by those skilled in the artthat a variety of specific interfaces may be chosen and implemented toachieve the above-described communication pathways.

A variety of variables are available for describing the state of thenode agent. These variables can be set by an administrator, or othersource, and may be present encoded within a server at startup, ordefault values assumed by the node agent. The default values may be setby an administrator, or other source. According to one embodiment, onstartup, the node agent checks to see if a node agent table(NodeAgentTbl) exists in a local database. That is, a server within acluster generally maintains a node agent table describing itsconfiguration. In other embodiments, agent tables are shared.

An exemplary embodiment of a node agent table (NodeAgentTbl) is shown asTable 2, along with some exemplary default values. It is to beunderstood that all or a portion of the described fields may be presentin various embodiments of the node agent table. Briefly, field ClusterMode 600 is associated with mode value or condition 601, such asStandby, indicating what mode the node agent is in. In one embodiment, anode agent operates in one of two modes—standby and cluster. In standbymode, the node agent operates as a server that streams video. In standbymode, the node agent does not know of other servers in a cluster anddoes not forward any requests. On activation to cluster mode, the serverautomatically discovers other servers in the cluster and will loadbalance play requests, as described further below.

Threshold value field 610, associated with threshold value 611, such asa value 70, is an optional but advantageous field and indicates athreshold load value. The determination of and use of this thresholdvalue is discussed further below, however, briefly, this value indicatesa load level above which a server will attempt to find another, lessloaded, server in the cluster to service a request even if the firstserver has access to the requested asset and has sufficient resources tostream the asset. Generally, and as discussed further below, thresholdvalue 611 ranges from 0 to 100 (typically scaled to represent a loadlevel between 0% and 100% of some nominal, predetermined, or maximumload), although in other embodiments other ranges are possible,depending on the method used to calculated threshold value 611. In apreferred embodiment, a load threshold value represents an indication ofthe load on a server including considerations to: percent CPU used,available memory, and available network bandwidth. Other considerationsare discussed further below.

In other embodiments, a plurality of threshold values are determined,each corresponding to a different server resource, and a plurality ofthreshold value fields appear in Table 2.

Bootstrap Interface field 620, associated with a Bootstrap Interface621, such as first reported network interface.

Hot Object Count field 630, is associated with count value 630, forexample, 60. Hot object counts are described further below. Hot ObjectPeriod Field 640, is associated with hot object period value 631, suchas 60 seconds. In preferred embodiments, hot object period isrepresented in seconds and ranges from about 30 seconds to about 1800seconds, although in some embodiments a longer or shorter time periodwill be used. Hot object periods are discussed further below. Briefly,if a number of requests for a first asset exceeds the hot object countduring the hot object period (i.e. more than 60 requests in 60 secondsin this example), the asset is considered ‘hot’, and the server willattempt to copy the asset to another server which does not have directaccess to the asset in order to increase the capacity of the cluster tostream the asset. Hot object count 630 and hot object period 640 may beentered by an administrator and may vary according to the presumedrelative popularity of an asset.

Additionally, a plurality of hot object count fields and hot objectperiod fields may appear in Table 2, each corresponding to a certainasset or group of assets.

Load Update Frequency field 650 is associated with a load updatefrequency 651, such as 5 seconds. Load update frequency 651 is discussedfurther below. Briefly, this indicates how often the server willbroadcast load information about itself. Shorter periods increase theamount of messaging traffic between servers, while longer periods mayresult in a situation where other servers may have outdated orinaccurate information about the first server's load.

Accordingly, on startup, if a node agent table (NodeAgentTbl) does notexist in a database, the node agent (nodeagent) for the server createsthe table with default values, in one preferred embodiment, the valuesare as shown in Table 2. If the node agent table exists, the nodeagentreads the values from the table and starts itself in the appropriatemode, given by Cluster Mode 601.

TABLE 1 Exemplary Opcodes and their uses. Opcode (reference #) UseI_AM_ALIVE (525) Bootstrap message. Sent to inform other servers that aserver is up. Message size is 8 bytes. PEER (530) Sent as response to anI_AM_ALIVE message. Message size is 8 bytes. DIGEST (535) Message isused for exchanging Summary Caches. LOAD (540) Periodic message sent outto inform other servers about load on server. Max size is 8 bytes.ASSET_INSERT (545) Notification message sent out to inform other serversthat an asset has been installed on server. Max message size is 20bytes + length of asset name + length of server name ASSET_DELETE (550)Notification message sent out to inform other servers that an asset hasbeen deleted on server. Max message size is 20 bytes + length of assetname + length of server name NODE_SHUTDOWN (555) Notification messagesent out to inform others if a node has been administratively shut down.Message size is 4 bytes. CLUSTER_SHUTDOWN Notification message sent outif an (560) administrator decides to shut down entire cluster. Messagesize is 4 bytes. LOAD_FREQ_CHANGE Notification message to inform other(565) servers that the load frequency has been altered. Other serversuse this message to reset their failure detection alarms. Message sizeis 8 bytes. ICP_IF_CHANGE (570) Notification message to server that thebootstrap interface has been changed and it needs to listen and send onthe new interface. Message size is 4 bytes.

TABLE 2 NodeAgentTbl exemplary fields and values Field Exemplary ValueCluster Mode (600) Standby (601) Threshold Value (610) 70 (611)Bootstrap Interface (620) First reported Network (621) Interface HotObject Count (630) 60 (631) Hot Object Period (640) 60 seconds (641)Load Update Frequency (650) 5 seconds (651)

Additionally, values in the node agent table—including hot object count,hot object period, load update frequency, and load threshold value—maybe dynamically updated during operation of the node agent, either uponrequest by a system administrator or other source, or automatically bythe node agent in response to operating conditions. In a preferredembodiment, a system administrator is able to change one or more hotobject count, hot object period, and threshold value using the clustermanagement console, described further below.

In preferred embodiments, on a cold start, that is where the server isconfigured for the first time, the node agent comes up in Standby mode.In this mode, the server can be monitored and administered, but it isnot a member of a cluster—that is, it does not communicate or exchangeload or asset information with other servers. The node agent can beactivated to the Cluster mode by an administrator either directly at theserver comprising the node agent, or remotely through a console.Activation is the process by which a node agent becomes part of acluster. By ‘part of a cluster’ herein is meant generally that a servercommunicates—that is sends and receives messages—with other servers. Thecollection of servers sending and receiving each others messages isgenerally referred to as a cluster.

An embodiment of the activation process is shown schematically in FIG.7. Briefly, FIG. 7 depicts three servers in a cluster, server 700,server 710, and server 720. Server 700 is in the process of activation.The three servers are in communication through communication links orother means discussed above. Arrows and connections shown in FIG. 7 areintended to show the flow of information and are not intended toindicate physical or separate connections between servers. Onactivation, the node agent associated with server 700 broadcasts, step730, an I am alive (I_AM_ALIVE) 525 message. In preferred embodiments,the I am alive (I_AM_ALIVE) 525 message is sent on port 9090. Themessage is received by servers 710 and 720, as well as any other serversin the cluster (not shown). Other servers that are up, including server710 and 720, respond with a digest message, step 740,—such as a messageusing digest opcode (DIGEST) 535. Once a server, such as server 710 or720, has retrieved the digest, the server sends out, step 750, a peer(PEER) message, using the PEER opcode 530 to build its clustermembership list. On getting this message, server 700 invokes a digestrequest, step 760, for example using the digest (DIGEST) opcode 535 onthe server having sent the peer (PEER) message (such as server 710 or720). Cluster 700 is operationally ready once the bootstrap phase isover. It then broadcasts load information, step 770, for example usingthe load (LOAD) opcode 540 to servers in the cluster periodically, asdictated by load frequency field 650.

Accordingly, servers in a cluster maintain a list of assets that areavailable in the cluster and where they reside (generally by sending andreceiving digest messages, updates, and asset insert or asset deletenotifications). Generally, every streaming server within the clustermaintains an asset list, in some embodiments, only a subset of serversmaintain an asset list, and in one embodiment, one server maintains anasset list. In some embodiments, therefore, the node agent caches alocal asset directory of the assets that are available on the localserver and also keeps an asset directory associated with each server inthe cluster. The local directory is communicated to the rest of theservers during the activation phase, summarized above and in FIG. 7.When a server receives a request that it cannot service, or in someembodiments, when its load is greater than a threshold value, itconsults these directories to select a server to forward the request.

The asset directories are advantageously compact and allow fast lookups,inserts and deletes. Accordingly, in preferred embodiments, assetdirectories are implemented as a Summary Cache, as known in the art anddescribed further in, for example, “Summary Cache: A Scalable Wide-AreaWeb Cache Sharing Protocol”, L. Fan, P. Cao, J. Almeida, and A. BroderIEEE/ACM Transactions on Networking 8(3): 281-293 (2000), herebyincorporated by reference herein. It will be readily appreciated bythose skilled in the art that other structures could be employed tomaintain an asset list at a server. Briefly, a Summary Cache representsa set of n elements as a bit vector of size n×m where m is referred toas the Bloom Load Factor. A set of hash functions that map into thisrange are chosen to support insertion, deletion, and membership queries.In a preferred embodiment, the node agent implements a Summary Cachewith a Bloom Load Factor of 16 and 4 hash functions. However, a BloomLoad Factor of generally between 8 and 64 and between 2 and 8 hashfunctions can be used, although in some embodiments a greater or lessernumber of either may be advantageous. The choice of the Bloom LoadFactor and the number of hash functions is influenced by the acceptableprobability of a false hit. A false hit occurs when the summary cacheresponds to a membership query by saying that the element exists but inreality it does not. For a Bloom Load Factor of 126 and 4 hashfunctions, the probability of a false hit is approximately a quarter ofone percent. In some embodiments, the hash functions are built by firstcalculating the MD5 signature of the asset name, as known in the art.Recall that an MD5 signature hashes an arbitrary length string into afixed length signature. In other embodiments, the hash functions arebuilt by calculating the MD5 signature of some other string uniquelyassociated with the asset.

One embodiment of a method for calculating indices in a Summary Cache isshown in FIG. 8. The MD5 signature of asset name 800 is calculated instep 810. The MD5 signature hashes an arbitrary length string into a128-bit signature 820. In other embodiments signature 820 is longer orshorter than 128 bits. The signature is then divided into four 32-bitintegers (integers 822, 824, 826, and 828) using modulo n×m, in step830. Integers 822, 824, 826, and 828 are used as the four hashes. Thatis, a ‘1’ is entered in a position of summary cache 840 corresponding tolocations given by integers 822, 824, 826, and 828. In preferredembodiments, the maximum number of assets on a server is set as acommand line parameter. In a preferred embodiment, the maximum number ofassets on a server is 1000, and the Summary Cache size is accordingly1000×16, or 16000 bits. Accordingly, integers 822, 824, 826, and 828 inFIG. 8 are between 0 and 15999. The size of the Summary Cache, andaccordingly the modulo number used in step 830 and the range of integervalues for integers 822, 824, 826, and 828 will vary according to thenumber of assets on a server and the length of the signatures.

Assets that are installed or deleted once the cluster is operationalgenerate notifications to the node agent. The node agent in turncommunicates this information using asset insert (ASSET_INSERT) 545 orasset delete (ASSET_DELETE) 550 messages to the rest of the servers inthe cluster. These messages broadcast indices to the Summary Cache thatneeds to be altered as a result of the installation or deletion of anasset.

As discussed briefly above, each server in a cluster calculates one or aplurality of factors associated with its load and broadcasts one or moreload factors, or metrics, to other servers in the cluster. That is, eachserver periodically (or according to some other scheme or policy)extracts a load metric or metrics, computes a load factor or factors andbroadcasts this information to servers in the cluster. Load metrics mayinclude, for example, any one or combination of CPU idle time, CPUutilization, amount of free physical and swap memory, and networkbandwidth utilized or available network bandwidth, or other load relatedmetrics or measures. Each of these metrics may be converted into a loadfactor through any variety of scaling and normalization procedures. Inone embodiment, a network bandwidth metric is calculated by determiningthe number of streams in use out of a known number of available streams.In a preferred embodiment, each metric is represented as a percentageand a plurality of metrics are summed and normalized to a number, anoverall load factor, between 0 and 100 that reflects the overall load onthe server. In some embodiments, a plurality of metrics are combined ina weighted sum. In some embodiments, higher numbers indicate greaterloads. In other embodiments, lower numbers indicate greater loads. Inother embodiments, a plurality of load factors are calculated, each fora different load metric or combination of metrics. Load information,comprising one or more load factors, is broadcast to other servers usinga load message, such as ICP_LOAD 540, or other like message protocol.The same or different weightings may be applied to different of themetrics so that their relative importance in the overall metric may beaccounted for.

Each server within a cluster further is configured, through programmodule node agent, to provide request forwarding. That is, on receivinga request for an asset, a first server checks to see if any of thefollowing conditions are true: (1) the asset does not exist on the firstserver, or is not associated with the first server—that is, the firstserver does not have metadata associated with the asset, a prefixassociated with the asset, or the asset itself residing on its directstorage, as appropriate with regard to the particular serverconfiguration; (2) sufficient resources do not exist to stream the asseton or from the first server; or (3) the current load on the first serveris over a threshold limit—that is a specified load factor exceeds athreshold limit, as discussed above. In some embodiments, the firstserver only checks if the asset does not exist on the first server andif sufficient resources do not exist to stream the asset on the firstserver; and a load threshold value is not checked. If any of theseconditions is true, the server attempts to locate a second server in thecluster that has the asset and sufficient resources to stream the asset.

In a case where the server has the asset and the resources, but has aload factor exceeding a threshold limit, it will attempt to find anotherserver that is less loaded (that is, has a load factor corresponding toa load less than the first server) and that has the asset. If it failsto locate another server, it will service the request. In someembodiments, the first server has a smaller overall load factor than asecond server, but a greater load factor of a critical metric. That isin some embodiments, a first server will attempt to forward a request ifa single load factor is greater than a threshold value corresponding tothat load factor. In preferred embodiments, the first server attempts toforward the request when its overall load factor is greater than athreshold value.

Accordingly, servers within clusters according to the present inventionmay advantageously but optionally have a load thresholding feature. Asdiscussed briefly above, a load threshold is a number corresponding to athreshold level for a load factor, discussed above. The load thresholdrepresents the load factor level beyond which the server will consultthe node agent to determine if there is a server that is less loadedthan itself that would be able to service the request. In preferredembodiments, the load threshold value is a number between 0 and 100 andcorresponds to the threshold level of an overall load factor, discussedabove, representing a plurality of load metrics. In preferredembodiments, a load threshold value of between 20 and 50 is used. Insome embodiments, a plurality of load threshold values are providedcorresponding to a plurality of load factors and the first serverattempts to locate a second, less loaded server when a predeterminednumber of load threshold values are exceeded. Accordingly, whileoperating over the load threshold, the cluster software, or programmodule, or node agent, adds a small overhead to the play requestprocessing, as it has to determine the most appropriate server in thecluster to service the request. In other embodiments, load thresholdingis not provided by the node agent. In still other embodiments, differentload assessment and/or allocation techniques or procedures may beapplied.

The load (LOAD) messages may advantageously double as heartbeats thatare used for failure detection in some embodiments. That is, each serverunder normal operating conditions broadcasts load information, forexample, using a LOAD message, at regular intervals given, for example,by load update frequency 651, or according to some other scheme orpolicy. In some embodiments, timers are programmed to trigger events inthe case where there has been no communication between a pair of nodesfor a certain length of time. The triggered event verifies if a serveris out of service or is merely slow in responding. If a first serverdetects that a second server is down, it marks the second server as downand removes it from membership of the cluster. When it receives an I amalive (I_AM_ALIVE) 525 message from the server that went down, itincludes it back into the cluster.

FIG. 9 provides a schematic overview of one embodiment of a requestforwarding procedure for a server in a cluster. A request for an assetis received in step 850. For example, referring back to FIG. 3, server115 may receive a request for an asset in step 850. The following methodcontinues to be discussed with reference to the cluster configurationshown in FIG. 3, however it is to be understood that the method isapplicable to all cluster configurations described above. Server 115determines, step 852, if it has the asset—that is, in embodiments usinga configuration such as that in FIG. 3, server 115 determines if assetinformation associated with the requested asset is stored on storagedevice 118. In step 854, the server (such as server 115) determines ifit has sufficient resources to stream the asset. In other embodiments,the decisions are made in a different order. If the server either doesnot have the requested asset or does not have sufficient resources tostream the asset, the server (such as server 115) will attempt toforward the request (step 856) to a second server (such as server 110)that does have the asset and sufficient resources to stream. In someembodiments, if the server has the requested asset and sufficientresources to stream, the server will simply stream the asset (step 858).In other embodiments, the server then determines if its load is lessthan a threshold value, step 860, as discussed above, and streams theasset (step 858) if the load is sufficiently light. If the load exceedsa threshold value, than the server attempts to find a second serverhaving the asset, sufficient resources to stream, and that is lessloaded, step 862. If the server finds such a second server, it forwardsthe request (step 864), and if not, the first server will stream theasset (step 858). In other embodiments, the first server givespreference in step 862 to servers having the complete asset rather thatservers having a prefix or other portion of the asset.

The request forwarding capabilities provided by the node agent—describedabove and in FIG. 9—allow load-balancing components, such as a Layer 4switch, to optionally be eliminated. That is, in preferred embodiments,a load-balancing component is not present to direct user requests to aparticular server within a cluster. Instead, user requests may enter thecluster at one or a plurality of servers, and the individual serversthemselves forward the requests as necessary. In other embodiments, aload-balancing component, such as a Layer 4 switch, is utilized todistribute requests.

FIG. 10 depicts a schematic overview of a control flow when a playrequest is forwarded. FIG. 10 depicts an embodiment using RTSP (RealTime Streaming Protocol) request forwarding. It will be understood bythose skilled in the art after reading this specification that otherprotocols may be used. Referring to FIG. 10, server 900 in a clusterreceives an RTSP Setup call, step 902, and decides to load balanceaccording to one or more of the criteria above, by forwarding therequest. Therefore, server 900 responds with an RTSP Multiple Choicesmessage (step 904). Included with the RTSP Multiple Choices message isthe name of an alternate server in the cluster that, in one embodiment,is the least loaded server that has the requested asset. Client 910 nowmakes an RTSP Setup call (step 912) to the new server, such as server914. On a successful setup, server 914 responds with an RTSP OK message(step 916). Client 910 can now play the asset (step 918) from the secondserver.

The present invention further advantageously but optionally providesmethods, procedures, and computer programs and computer program productsfor hot or high demand asset load balancing. Briefly, an asset (such asa feature video movie or motion picture) is said to be hot when usagestatistics indicate a spike or other high-demand condition in the numberof requests for that asset. Generally, a spike means a flurry ofrequests in a short period of time. Accordingly, servers in clusters ofthe present invention are configured to provide a hot asset triggerthrough the computer program module, or node agent, installed therein.In a preferred embodiment, the hot asset trigger, represented by hotasset count 630 and hot asset period 640, is set by an administrator. Inother embodiments, hot asset count 630 and hot asset period 640 aredynamically selected and/or updated by the node agent, or by the serveritself. The trigger is fired or released when the number of requests foran asset within hot asset period 640 exceeds hot asset count 630. Inother embodiments, the trigger is fired when the number of requests foran asset within hot asset period 640 equals or exceeds hot asset count630. Once the trigger is fired, that is, if the number of requests foran asset within hot asset period 640 equals or exceeds hot asset count630, the node agent will replicate the asset to the least lightly loadedserver in the cluster (or some other server in the cluster that hascapacity to serve according to some scheme or policy) that does not havethat asset. In some embodiments, a service wrapper is provided, a videotransfer service, that provides video content delivery functionality.This wrapper, or video transfer service, provides a computer programmodule containing instructions to replicate an asset.

In some embodiments, the entire asset is not replicated to anotherserver not having the asset once the asset is considered ‘hot’, rather,a variable length prefix of the asset is replicated to another server.Generally, a prefix of an asset comprises between 5 and 50 percent ofthe asset, although in some embodiments a larger or shorter prefix maybe transferred. This is referred to as prefix caching. Embodiments ofprefix caching for media objects are described in copending U.S. patentapplication Ser. No. 09/774,204 filed 29 Jan. 2001 and entitled PrefixCaching for Media Objects, herein incorporated by reference.

When a second server having a variable prefix of an asset receives arequest for that asset and conditions are suitable for the second serverto service that request, it begins playout of the prefix and requeststransfer of the entire asset from a server having the asset, or from acentralized storage location, depending on the configuration of thecluster. In still other embodiments, the entire asset is not replicatedto another server not having the asset once the asset is considered‘hot’, rather, metadata associated with the asset is replicated toanother server, and the server requests a copy of the entire asset uponreceiving a serviceable request.

A Cluster Management Console may be provided to allow an administratorto effectively manage a cluster. The Cluster Management Console isgenerally a centralized tool to define, configure, administer andmonitor the servers in a cluster. The Console collects serverinformation, asset information, and load and stream counts, and presentsthe information or data in an easy to view format. An administrator canthen use this information to move and replicate assets, add or removeservers, adjust parameters to keep the cluster running at ideaperformance, and the like. Generally, then, the Cluster ManagementConsole provides all or a subset of the following functionalities:defining a cluster; adding and/or removing servers from a cluster;activating and/or deactivating servers in a cluster; configuring clusterparameters; displaying server information, cluster configuration, assetlistings, SNMP events, and the like; displaying system error, warnings,and the like by enabling SNMP traps; monitoring server load, activestream counts, asset requests, and hot objects; administering a serverusing the administrative web graphical user interface; logging in to acluster; and playing out, transferring, listing locations of replicatedassets, renaming and deleting assets.

In some embodiments, clusters of the present invention are configured tosupport a single signon feature. That is, when servers in a cluster areoperating with A4 services (Authentication, Authorization, AccessControl and Accounting) enabled, the server is a secure server an onlythose authorized by a successful login may be able to play out assets,and it may become inconvenient for an administrator to have to log on toeach of the servers separately. The Single Sign On feature allows anadministrator to log on once to a cluster using a Cluster ManagementConsole, and be able to administer any of the servers in the clusterwithout having to log on separately. Once an administrator logs on to acluster using the Console, the user credentials are passed along withany administer or play requests. The Console can be implemented as aprogram module having a variety of formats, such as for example, a JavaApplet. In some embodiments, the Console is installed on a server withina cluster. In other embodiments, the Cluster Management Console resideson a computer or other device having a processor and in communicationwith a server or servers in the cluster.

One embodiment of the graphical appearance of console view 1000 is shownin FIG. 11. Console view 1000 comprises Cluster View pane 1005, ServerView pane 1010, and Message pane 1015. Cluster View pane 1005 is used todefine new clusters, add and delete servers in a cluster and to browsdifferent clusters and servers that are part of each cluster. Forexample, Cluster View pane 1005 shows three clusters—a first cluster(L4Cluster), a second cluster (QeCluster), and a third cluster (Jglue).Where ‘L4Cluster’, ‘Qecluster’, and ‘Jglue’ represent arbitrary clusternames.

Servers in any or each cluster can be viewed—for example, ‘QeCluster’comprises servers ‘glimmer’, ‘gelato’, ‘qalinux3’, and ‘rigel’, in FIG.11, where server ‘glimmer’, ‘gelato’, ‘qalinux3’, and ‘rigel’ are namesassigned to the particular servers, respectively. Server View pane 1010provides detailed information about servers and buttons for monitoringvarious cluster-wide data. For example, Server View Pane 1010 hasbuttons to view general information, monitor information, and assetcatalog information. As shown, Server View pane 1010 displays assetcatalog information including assets contained in cluster ‘Qecluster’.Message pane 1015 is used for informational messages, and fornotification of warnings or critical events. As shown, Message pane 1015displays several messages, including that ‘gelato’ in ‘Qecluster’ wasrestarted. The date and time of the messages may also be shown.

The Console can be used to view multiple clusters, as shown in FIG. 11.In preferred embodiments, an administrator defining a cluster wouldcreate views in console 1000 that reflect physical clusters, asdescribed above. In other embodiments, views in Console 1000 do notreflect physical clusters.

Critical errors, warning, asynchronous event notifications (hot objecttransfer completion, for example), and the like are reported back to theCluster Management Console as SNMP traps. An administrator using theConsole is accordingly informed about such events on any server in acluster and if needed can then take appropriate action. FIG. 12schematically depicts one embodiment of how asynchronous events (traps)are propagated to the Cluster Console. These traps are errors orwarnings that are generated in the cluster, that may require immediateattention, or meet some other criteria. Console 1300 registers (step1302) with SNMP Service 1305. SNMP Service 1305 implements the SimpleNetwork Management Protocol (SNMP) and acts like a central clearinghouse for traps. Service 1308, such as a node agent, or other computerprogram module, or plurality of services, generate traps that are sent(step 1310) to SNMP Service 1305. On receiving a trap, SNMP Service 1305forwards the trap to Console 1300 which then displays the message.Additionally, in some embodiments, services 1308 send, step 1318, erroror warning messages to log 1320. Log 1320 may further send traps, step1322, to SNMPService 1305.

The Cluster Management Console further allows for monitoring of serverload, cluster-wide active stream counts, and asset popularity. Inaddition, playout status, disk status, network status, and the like canbe monitored if the appropriate SNMP agent is running on the desiredservers. FIG. 13 depicts one embodiment of a Load Monitor displayable bythe Console. Load Monitor 1100 displays four load graphs 1101, 1102,1103, and 1104—each corresponding to a different server. Y-axis 1105represents a load factor, discussed above, and X-axis 1110 representstime. Bar scale 1115 gives another depiction of load level. FIG. 14depicts one embodiment of a stream monitor displayable on the console.Stream monitor 1200 depicts information associated with threeservers—1201, 1202, 1203, and 1204 in an additive manner such that thetotal number of streams playing can also be viewed. Y-axis 1210represents number of streams, while X-axis 1215 represents time.

Clusters according to the present invention further maintain countersthat allow an administrator to view or ascertain operational health ofthe cluster. Generally, each server maintains some or all of thecounters described below. In other embodiments, counters are shared. Insome embodiments, counter information is aggregated and displayed by theCluster Management Console, described above, that contacts each of theservers in the cluster. Exemplary counters, all or some of which may beimplemented in a particular cluster are: (1) an asset not cached counter(AssetNotCached) that is incremented when a server receives a requestfor an asset that is not installed locally; (2) an asset not in clustercounter (AssetNotInCluster) that is incremented when a server receives arequest for an asset that is not installed locally and also is unable tofind it anywhere in the cluster; (3) a resources unavailable counter(ResourcesUnavailable) that is incremented when a server receives a playrequest for an asset that is installed locally, but the server does nothave the resources to play the request; (4) a first try counter(FirstTry) that is incremented when a server looks for an alternateserver to service a play request and finds one in the first try; (5) asecond try counter (SecondTry) that is analogous to the first trycounter (FirstTry), but in this case it takes two attempts to find analternate server to service the request. If this counter is rapidlyincreasing, one possibility is that the load information is not beingexchanged frequently enough; (6) a three or more counter (ThreeOrMore)that is incremented when it takes more than two requests to service arequest (this counter may further indicate a need to change the loadupdate frequency); (7) an out of cluster resources counter(OutOfClusterResources) that is incremented when a server receives arequest for an asset that it cannot service but also finds out that noother server in the cluster can service the request (this counter mayindicate the cluster is operating at peak capacity and more servers mayneed to be added to the cluster if this counter is rapidly increasing);(8) an ICP messages counter (ICPMessages) that is incremented when aserver receives an ICP_QUERY message from a cache, inquiring about thepresence of an asset; (9) an ICP hits counter (IcpHits) that isincremented when a server responds to an ICP_QUERY message with anICP_HIT message (the server responds with an ICP_HIT message when therequested asset is present in the cluster); (10) an asset insertscounter (Assetinserts) that is incremented when an asset is installed atthe server; (11) an asset deletes counter (AssetDeletes) that isincremented when an asset is deleted from a server; and (12) a falsehits counter (FalseHits) that is incremented when a server receives arequest to play an asset from another server in the cluster but thereceiver does not have the requested asset (false hits lead to moremessages and increase the response times).

The invention may advantageously implement the methods and proceduresdescribed herein on a general purpose or special purpose computingdevice, such as a device having a processor for executing computerprogram code instructions and a memory coupled to the processor forstoring data and/or commands. It will be appreciated that the computingdevice may be a single computer or a plurality of networked computersand that the several procedures associated with implementing the methodsand procedures described herein may be implemented on one or a pluralityof computing devices. In some embodiments the inventive procedures andmethods are implemented on standard server-client networkinfrastructures with the inventive features added on top of suchinfrastructure or compatible therewith.

The invention also provides a business model and method for distributionof content and assets (such as video movies) as well as a business modeland method for operating and growing a scalable content and assetdistribution system.

In one embodiment, the invention provides a business model for operatinga time-base accurate asset streaming business including: operating afirst server to receive and service requests for an asset, the firstserver (i) receiving a request for an asset, (ii) determining if thefirst server has the asset available for time-base accurately streamingand has sufficient resources to time-base accurately stream the asset,and (iii) time-base accurately streaming the asset if it is determinedthat the first server has the asset available for time-base accuratelystreaming and has sufficient resources to time-base accurately streamthe asset; and if the determining indicates that the first server doesnot have the asset available for time-base accurately streaming or doesnot have sufficient resources to time-base accurately stream the asset,then: (i) identifying a second server having the asset available fortime-base accurately streaming and sufficient resources to time-baseaccurately stream the asset, and (ii) forwarding the request to theidentified second server for servicing by the second server. This assetmay for example comprises a multi-media asset such as for example avideo movie or other asset type described herein.

Embodiments of the business model and method may include or utilizefeatures of the inventive system, method, procedures and computerprogram and computer program product described elsewhere herein and notseparately described relative to the inventive business model andmethod.

The foregoing descriptions of specific embodiments and best mode of thepresent invention have been presented for purposes of illustration anddescription. They are not intended to be exhaustive or to limit theinvention to the precise forms disclosed, and obviously manymodifications and variations are possible in light of the aboveteaching. The embodiments were chosen and described in order to bestexplain the principles of the invention and its practical application,to thereby enable others skilled in the art to best utilize theinvention and various embodiments with various modifications as aresuited to the particular use contemplated. It is intended that the scopeof the invention be defined by the claims appended hereto and theirequivalents.

1. A server system for time-based media streaming comprising: aplurality of servers coupled for communication and including a firstserver; and a computer readable storage medium at said first serverstoring therein information associated with an asset that is replicatedin a computer readable storage medium associated with at least one otherserver; said first server being configured to: (i) receive a request forsaid asset from an external client coupled to the server system; and(ii) determine if said asset is a hot asset, and if said asset isdetermined to be a hot asset, then copying information associated withsaid asset to a second server including copying at least one of: (a) aprefix of said asset to said second server, and (b) copying the asset tothe second server and streaming the prefix of the asset from the secondserver.
 2. A server system according to claim 1, wherein said assetcomprises an audio or a video.
 3. A server system according to claim 1,wherein said asset information comprises metadata associated with anasset.
 4. A server system according to claim 1, wherein said firstserver stores a hot asset count value and a hot asset period and isfurther configured to keep track of received requests for assets and toreplicate part or all of the asset to another server in response to adetermination that a number of received requests for the asset during aperiod equal in length to the hot asset period exceeds the hot assetcount value.
 5. A server system according to claim 1, wherein said firstserver is further configured to cause the request from the client to besent to the second server by informing the client to send the request tothe second server.
 6. A method for time-based streaming of assets, saidmethod comprising: receiving a request from a client for an asset at afirst server; and determining if said asset is a hot asset, and if saidasset is determined to be a hot asset, then copying informationassociated with said asset to a second server including copying at leastone of: (a) a prefix of said asset to said second server, and (b)copying the asset to the second server and streaming the prefix of theasset from the second server.
 7. A computer program product for use inconjunction with a first server having at least one processor and amemory coupled to the processor, the first server being in communicationwith at least one second server, the computer program product comprisinga computer readable storage medium and a computer program mechanismembedded therein, the computer program mechanism comprising: a programmodule that directs the first server to function in a specified mannerto provide for time-based streaming of assets upon receiving a requestfor an asset from an external client the program module includinginstructions for: receiving a request from a client for an asset at afirst server; and determining if said asset is a hot asset, and if saidasset is determined to be a hot asset, then copying informationassociated with said asset to a second server including copying at leastone of: (a) a prefix of said asset to said second server, and (b)copying the asset to the second server and streaming the prefix of theasset from the second server.
 8. A method for operating a time-baseaccurate asset streaming business, said method comprising: operating aplurality of servers each configured to receive and service requests forassets from external clients of said business, said operatingcomprising: operating a first server to receive a request from a clientfor an asset at said first server; and operating said first server todetermine if said asset is a hot asset, and if said asset is determinedto be a hot asset, then copying information associated with said assetto a second server including copying at least one of: (a) a prefix ofsaid asset to said second server, and (b) copying the asset to thesecond server and streaming the prefix of the asset from the secondserver.
 9. A method for dynamically adjusting to content deliveryservice demand in a real-time system, the method comprising: detectingdemand for a particular asset; and automatically and dynamicallyincreasing a capacity for playing out a particular asset when demand forthat asset increases.
 10. A method as in claim 9, further includingdetecting demand for a plurality of different assets and automaticallyload-balancing said playing out said plurally of assets in response tosaid detected demands.
 11. A system for dynamically adjusting to contentdelivery service demand in a real-time system, the system comprising: aplurality of servers coupled for communication; and at least a first oneof said plurality of servers including a receiver for receiving arequest for an asset from an external client, and a detector fordetecting demand for a particular asset based on said received requests;and said plurality of servers being configured to automatically anddynamically increase a system capacity for playing out a particularasset when demand for that asset increases.